Goto

Collaborating Authors

 code property


77b5aaf2826c95c98e5eb4ab830073de-Supplemental-Conference.pdf

Neural Information Processing Systems

Asystem of regions (also referred to as anetwork) can comprise multiple disjoint regions that exhibit shared activity patterns across a range of tasks. The auditory system is located in the superior temporal region of the brain. This region uniquely encodespitch, speech, and music, butisnot involvedinhigh-levellanguage comprehensionandproduction[Norman-Haignereetal.,2015,2019].Inourexperimentspertaining to programming language comprehension, we use the activity seen in the auditory system as a negativecontrol. ForthePython program comprehension experiment, individual programs were modeled using the period from the onset of the code/sentence problem until the buttonpress. See Fedorenko et al. [2010] for a discussion of the functional localization approach as it pertains to the language network.



Convergent Representations of Computer Programs in Human and Artificial Neural Networks

Neural Information Processing Systems

What aspects of computer programs are represented by the human brain during comprehension? We leverage brain recordings derived from functional magnetic resonance imaging (fMRI) studies of programmers comprehending Python code to evaluate the properties and code-related information encoded in the neural signal. We first evaluate a selection of static and dynamic code properties, such as abstract syntax tree (AST)-related and runtime-related metrics. Then, to learn whether brain representations encode fine-grained information about computer programs, we train a probe to align brain recordings with representations learned by a suite of ML models. We find that both the Multiple Demand and Language systems--brain systems which are responsible for very different cognitive tasks, encode specific code properties and uniquely align with machine learned representations of code.


A Brain regions

Neural Information Processing Systems

A system of regions (also referred to as a network) can comprise multiple disjoint regions that exhibit shared activity patterns across a range of tasks. The auditory system is located in the superior temporal region of the brain. The voxels were then filtered using gray-matter masking and (for MD and the Language systems) network localization. See Fedorenko et al. [2010] for a discussion of the functional localization approach as it pertains to the language network. For each brain system and each code property or code model, we run a separate MVP A analysis.



Convergent Representations of Computer Programs in Human and Artificial Neural Networks

Neural Information Processing Systems

What aspects of computer programs are represented by the human brain during comprehension? We leverage brain recordings derived from functional magnetic resonance imaging (fMRI) studies of programmers comprehending Python code to evaluate the properties and code-related information encoded in the neural signal. We first evaluate a selection of static and dynamic code properties, such as abstract syntax tree (AST)-related and runtime-related metrics. Then, to learn whether brain representations encode fine-grained information about computer programs, we train a probe to align brain recordings with representations learned by a suite of ML models. We find that both the Multiple Demand and Language systems--brain systems which are responsible for very different cognitive tasks, encode specific code properties and uniquely align with machine learned representations of code.


miniCodeProps: a Minimal Benchmark for Proving Code Properties

Lohn, Evan, Welleck, Sean

arXiv.org Artificial Intelligence

Neural networks have shown initial promise in automating mathematical theorem proving in proof assistants such as Lean. The same proof assistants can be used to verify the correctness of code by pairing code with specifications and proofs that the specifications hold. Automating the writing of code, specifications, and proofs could lower the cost of verification, or, ambitiously, enable a machine learning system to output provably correct code. However, it remains unclear whether current neural theorem provers can automatically verify even relatively simple programs. We present miniCodeProps, a benchmark of 177 program specifications in the Lean proof assistant, aimed at the subproblem of automatically generating a proof for a provided program and specification. miniCodeProps contains specifications about simple, self-contained programs (e.g., lists, natural numbers, binary trees) with varied proof difficulty. Despite its simplicity, miniCodeProps is challenging for current LLM-based provers, which succeed in proving about 25 percent of the specifications. We publicly release miniCodeProps as a benchmark for furthering automated theorem proving in the context of formally verified code.


Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond

Shi, Ensheng, Wang, Yanlin, Zhang, Hongyu, Du, Lun, Han, Shi, Zhang, Dongmei, Sun, Hongbin

arXiv.org Artificial Intelligence

Recently, fine-tuning pre-trained code models such as CodeBERT on downstream tasks has achieved great success in many software testing and analysis tasks. While effective and prevalent, fine-tuning the pre-trained parameters incurs a large computational cost. In this paper, we conduct an extensive experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning. We then propose efficient alternatives to fine-tune the large pre-trained code model based on the above findings. Our experimental study shows that (1) lexical, syntactic and structural properties of source code are encoded in the lower, intermediate, and higher layers, respectively, while the semantic property spans across the entire model. (2) The process of fine-tuning preserves most of the code properties. Specifically, the basic code properties captured by lower and intermediate layers are still preserved during fine-tuning. Furthermore, we find that only the representations of the top two layers change most during fine-tuning for various downstream tasks. (3) Based on the above findings, we propose Telly to efficiently fine-tune pre-trained code models via layer freezing. The extensive experimental results on five various downstream tasks demonstrate that training parameters and the corresponding time cost are greatly reduced, while performances are similar or better. Replication package including source code, datasets, and online Appendix is available at: \url{https://github.com/DeepSoftwareAnalytics/Telly}.


(PDF) Research on attack cases via topic-model analysis and selection of vulnerability candidates from large-scale vulnerability database

#artificialintelligence

Predicting software vulnerability discovery trends can help improve secure deployment of software applications and facilitate backup provisioning, disaster recovery, diversity planning, and maintenance scheduling. Vulnerability discovery models (VDMs) have been studied in the literature as a means to capture the underlying stochastic process. Based on the VDMs, a few vulnerability prediction ... [Show full abstract] schemes have been proposed. Unfortunately, all these schemes suffer from the same weaknesses: they require a large amount of historical vulnerability data from a database (hence they are not applicable to a newly released software application), their precision depends on the amount of training data, and they have significant amount of error in their estimates. In this work, we propose vulnerability scrying, a new paradigm for vulnerability discovery prediction based on code properties.